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Abstract. Statistical analysis of bacteria genomes texts has been performed on 
the basis of 20 complete genomes origin from Genebank. It has been revealed that 
the word ranked distributions are quite well approximated by logarithmic law. Results 
obtained in the absent words investigation show the considerably nonrandom character 
of DNA texts. In character of autocorrelation function behavior in several genomes 
period 3 oscillations were found. Short range autocorrelations are present in short 
(n = 3) words and practically absent in longer words. 
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1 Introduction 

DNA molecules are main storage of information about any organism. They are 
long sequences (linear or closed to a loop) contained in each cell of an organism. 
Usually DNA sequences are represented by a string of just four letters (A, C, G, 
T), each of them corresponds a definite type of nucleotides: adenine, cytosine, 
guanine and thymine. These letters can form different combinations. Purposely, 
some combinations in DNA texts are nonrandom. They reflect structure and 
function of DNA and proteins. Where from the question arises, what are regu- 
larities of such letter sequences corresponded to known DNA properties? 

Due to modern automatic techniques and new technologies of genome se- 
quencing one can observe great increase of DNA texts data 0. The crucial 
question of modern genomics is what kind of information can be extracted from 
these data? In this realm many statistical methods were applied or even elabo- 
rated for DNA_sequences analysis. Great success 'was achieved in DNA sequences 
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Hurst index estimation |p3| , |l4[ , transition matrix analysis |l4|, random walk 
[|[ , usage of the mutual information function jl6], [l7) , detrended fluctuation 
analysis |t], linguistics methods |h| [20[ pl|, |2^| etc. Large number of models, 
DNA emulated, have been constructed flq, |23(|. 

Although some of the studies are in contradiction with each other, the pres- 
ence of long-range correlations and period-three oscillations in DNA sequences 
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one can believe statistically postulated. Now the 3 bp periodicity and the mu- 
tual information function are widely used to find exons in newly sequenced DNA 
[ p4[ ^ [L6[ . Long range correlations are discussed in connection with chromoso- 
mal organization of genomes || . 

Let us notice that the main statistical investigations was performed in the let- 
ter (single nucleotide) sequences consideration. At the same time such elements 
of DNA structure as three nucleotides (codon) protein coding, the regulatory 
units such as promoters, splice sites, enhansers and silencer fairly difficult, or 
with too low accuracy, can be detected or predicted just letter analysis. So 
it seems to be needed to consider regularities no only in the letter sequences 
but among them groups (words) as well. As concrete example of the words im- 
portance one can take restrictase recognition sites as Bgll: GCC GGC 

where five nucleotides between sequences GCC GGC may be any, or classi- 
cal Pribnow-Hilbert blocks: TTGACA -TATAAT, where moreover, the 

distance between the blocks can vary [p5| . 

Some attempts of statistical analysis and classification of short (3-8 nucleo 
tid es) sequences have been made |2(| ^8|, Especially it concerns three 
letter sequences (codons), since these codons form so say amino acids (20 letters) 
language. But comparing the results, one can see that they strongly depend on 
an envisage object ^2). Let us notice that triplets analysis is also insuf- 
ficient for DNA structure explanation, for example, it tells us nothing about 
DNA conformational properties, interactions with proteins and protcins-RNA 
complexes, equilibrium between mutation and heredity. Though it is reasonable 
to suggest that this information presents in the DNA text as well. 

If we would analyze any single genome, rather we obtain a result appropri- 
ate for the very narrow field of investigated objects. If we would study words' 
statistical properties for drastically different organisms, we obtain strongly dis- 
tinct results [^7], ^8|, . For this reason we decide to pay attention to bacteria 
genomes. From one hand there are different kinds of bacteria that, as one can 
suppose, is reflected in some distinctions of their genomes, from the other cer- 
tainly one bacteria kind is closer to other one rather than to another species 
as, for example, plants or viruses, no concerning of higher organisms. So the 
goal of the present paper is comparative analysis of bacteria genomes in words 
context. 

In second section we present distribution of the word frequency versus the 
rank in analogy to the Zipf analysis of natural languages Q , and compare the 
results with ones from linguistics DNA analysis J2f| . In Sec. 3 we pay attention 
to the most frequently appeared words and almost never realized ones. Section 
4 show the result of autocorrelation analysis. Finally conclusions on the basis 
performed comparative statistical analysis are given in Sec. 5. 



2 Word frequency 

In analogy to Zipf analysis of natural languages |5(| we study distribution of 
word's frequency versus rank. In order to obtain the distribution we first rank 
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order the total number of occurrences of each word, and than plot their relative 
values against rank. We have investigated words of length (n) from 1 to 7 
nucleotides of 20 complete bacteria genomes origin from GenBank Whose 
size, short (using in this paper) and full names are presented in the Table 1. 

The sequences length ranges from 580074 bp (mgen) to 4639221 bp (ecoli). 
We use the same method of sliding window as [|l9| ^lj for frequency of oc- 
currences obtaining. According to this approach a window of length n nu- 
cleotides/letters is taken and the set of blocks/ words of size n is obtained by 
shifting the window on one letter at a time. We look for all possible words 
from 4 n for each n G [1,7]. If n < 5 the number of all possible variants is less 
300 that, generally speaking, seems to be insufficient for the statement that the 
distribution obeys some law. For this reason, only for n £ [5, 7] we claim that 
the distributions quite well approximated by logarithmic law 

f(r) = -al0 4 lnr 

where r - rank and / - frequency of occurrences of a word (approximation 
accuracy ranges from 95.6% (hpyl genome) to 99.6% (synecho genome)). Indexes 
a for each genome and n € [5, 7] are presented in the Table 2. The Figure 1 
shows the distributions for tpal genome, which has the smallest index, mgen 
genome which has the largest index and aquae genome - an intermediate case. 

Let us notice that the indexes do not depend on genome size. 

Looking at the Figure 1 in pl| , it is easy to see that the power law approx- 
imation no exceeds the value 10 3 of the rank, what is only the third part of 
the pointed graph, for others two parts it is obviously not so. Other works jl9| 
also confirm that a power low is not the better approximation of the rank word 
distribution and DNA texts have not to believe to be written on a language in 
the linguistic sense. 



3 Most and least frequently met words 

DNA molecules consists of two strands letter sequences corresponded each other 
according to the rule: versus A letter on other strand the letter T is situated, 
versus C - G, G - C and T - A. It is the property of complementarity of DNA 
strands. Looking for what words are the most frequently met in the genomes' 
texts, we obtained that for n € [2, 6] it is polyA fragments (AAA. ..A = (A) n ) (or 
taking into account sequences' complementarity, it should be polyT fragments 
((T) n ) as well. Let us sign this fact as (A/T) n ), where n is the length of 
the fragment. In the Table 3 the results of polyA/T sequences occurrence for 
n € [2, 7] as well as the genomes, whose most frequently met words is no (A/T) n , 
are presented. 

Dominant polyA/T sequences were found in many DNA investigations fl3~ifl . 
There are some explanations of this phenomenon as, for instance, the fact that 
(A,T) relation is weaker than (C,G), or that (A) n , (T) n sequences have a specific 
three-dimensional structure different from one of (C)„, (G)„ or other chains, 
that can be necessary for nucleosomes organization p2]. Just the question 
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remains why is it so only for the fragments' length less than 7 (for the 20 
complete bacteria genomes of different sizes)? 

For marked by (*) genomes in the Table 3 fraction of ApT (A plus T) 
nucleotides is greater than fraction of CpG. Just 4 others genomes have fraction 
of CpG greater than one of ApT. 

Below the most frequently met words for each envisaged genome in corre- 
sponding to given in the Table 1 order are presented: 

CCTCCTC GAGGAGG AAAGGAA TTTTTCA TTTTTCT AGAAAAA 
CGCCAGC AAAAAAT TTTAAAA TTTTTAA TTAAAAA TTTTTAA TC- 
CCTGA GCCGCCG TCCTGGG TCTCCTT TATTTTT GGCGATC GAAA- 
GAA CGCGCGC 

The most frequently met word on the level 7 (for word length equal to 7) 
among 20 genomes is TTTTTAA (2 genomes). 

On 6 level: CTCCTC GAGGAG AAGGAA TTTTTT TTTTTT AAAAAA 
CGCCAG AAAAAT TTTTTT AAAAAA TTTAAA TTTAAA CCCTGA CCGCCG 
CTTCCT CTTCTT ATTTTT CGATCG GAAGAA GCGCGC 

That, as it can be seen, differs from the results of pg| |. 

For 9 from 20 investigated genomes (aero, aful, aquae, ecoli, hinf, mthe, 
mtub, rpxx, tpat) the dominant words (having maximal frequency of occurrences) 
of length 7 differ from ones of length 6 just on a single nucleotide added at 
the beginning or end of the word. 8 genomes (aero, aful, ecoli, mthe, mtub, 
pabyssi, synecho, tpat) in the most frequent words have CpG fraction greater 
than ApT. Let us notice, mainly in the dominant words C and G letters appear 
in GC/CG compositions and never we met there the fragments (C/G)fc>3. This 
can be connected with that CG (or GC) repeats in DNA in greater degree 
than polyC/G fragments supply maximum contribution into free energy of the 
secondary structure p5[ . 

Since we look for every of 4™ possible words, turn out to be that no all of them 
are realized in each genome. Namely, for n < 6 all possible 1024 words appear at 
least once in every genome. For n = 6 three genomes have no some words: hpyl 
- TCGACA GTCGAC, mgen - CTCGGA CCGGCC TCGGCC GGACGC CG- 
GCGC CCCGGC GGCCTC GCCGTC TCCGAG CGCGCG TCGGCG GGC- 
CGG CCTCGG GGTCGG, mjan - GTCGAC GCGCGC CGATCG. For n = 7 
there are only 4 genomes containing all words: aero, bsub, ctra, tpal. The num- 
ber of absent words for others varies from 1 (ecoli, synecho) to 851 (mgen). In 
the Table 4 one can see the number of absent words for investigated genomes. 
Where from one can see that words' absence is not follow to genome length, as 
it should be for random sequences. 

The rarest word (which are absent in the envisaged genomes more often 
than others) for n = 6 is GTCGAC (2 genomes), for n = 7 - GCGCGCG (6 
genomes), CGCGCGC GTCGACG GGCCTCG (4 genomes). 

All absent and rarest words contain greater fraction of CpG than ApT. At 
the same time neither polyC/G words nor even (C/G)fc>3 fragments of the words 
are not absent on the level 6 or rarest on the level 7 in any from investigated 
genomes. In the absent and rarest words quite often one can meet CG or 
(CG)fe fragments, which as it claimed in |2f| are more energetically profitable for 
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secondary structure formation. The question remains why energetically more 
profitable (CG)fe fragments are present in the row of the absent or rarest words 
but less energetically profitable polyG/C fragments have not been found there? 

Moreover among the absent or rarest words of length 6 there are several (com- 
plemented) palindromes (GTCGAC, CGCGCG, GTCGAC, GCGCGC, CGATCG). 
That can be understand in biological context as such palindromes are well known 
restriction enzyme cut sites, and hence are avoided by bacteria. Thus absent 
and rarest words investigation seems to be also important part of DNA studies, 
since it can give us a relevant information. 

4 Autocorrelation analysis 

We have performed analysis of autocorrelations for the most frequently met 
words. We considered length n = 3, 6, 7. For the analysis we use standard pro- 
cedure of translation of genomes' letter sequences into number representation. 
Namely we divide a letter genome sequence into words of length n shifting the 
frame/window of length n on one nucleotide for getting a new word. If the word 
on i — th position is the most frequent for an envisaged genome it is replaced 
by 1 in the new representation, let us denote this fact as Xi = 1 and xi = 
otherwise. So we obtain the row of N — n+1 numerical values {xi}^S[ n+1 where 
N is genome size. 

The autocorrelation function R(l) of a numerical sequence can be written as 

R(l) =< XiX i+ i >, 

where the brackets denote average over the sites along the sequence. Since the 
number of units in the chain in our case is fairly small, we are interested in 
merely the quality results, in other words, in character of R(l) itself. 

It has been obtained that for n = 3 there are almost only correlations of 
order 1 or 2. However for several genomes: ecoli, rathe, mtub, tmar in R{1) 
behavior oscillations of period 3 are observed. 

In case n = 6, R(l) behavior acquires greater distinctions. In the genomes 
synecho, pabyssi, rathe, mpneu, hinf correlations rather are absent, essential 
ones are in hpyl, bsub genomes (almost on any up to I = 50 scale). Period 3 
oscillations are present in the genomes mtub, mjan, aquae, aful, aero. Let us 
notice that for mtub genome correlations are quite strong even for scale I ~ 10 3 ,. 
Ones are weak in others genomes with period 3 oscillations. 

As for n = 7, in a whole in investigated genomes there is tendency of exis- 
tence of greater correlations on I module 3. The most strong correlations are 
found as before in mtub genome, moreover there are existed on very large scale. 
In Fig. 2 one can see R(l) for this genome (n = 6). Here also one can mention 
that tpal genome has correlations of order 2 and 4, mjan - 12 and 21, hpyl - 10, 
15, 21, 39, 45, aero and aful - 3. 

More detailed analysis of the genome mtub structure reveals that period 3 
oscillations are characteristic for the second and third word in ranked words 
distribution as well, for the forth word it is not so. More often met words have 
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on order higher frequencies of occurrences for 6 from 20 envisaged genomes (for 
n = 7). Three of them (hpyl, mjan, mtub) reveal correlations. Mtub genome 
has the frequency of first three words order 3 and oscillations of period 3 are 
characteristic for first three words as well. Synecho has sharpest drop in ranked 
word distribution after forth word (Fig. 3). 

Such characteristic tendency of period 3 oscillations existence as in letters as 
in words of different length investigations of DNA texts can be connected with 
scale invariance or self affinity of genomes organization, in other words, DNA 
sequences to all appearances posses by fractal properties. 

5 Conclusions 

On the basis performed statistical analysis of the bacteria genomes the main 
conclusions are followed. 

The ranked word distributions quite well approximated by logarithmic law. 

Results obtained in absent words investigation show the considerably nonran- 
dom character of DNA texts sequences and allow to reveal biologically relevant 
units as restriction enzyme cut sites. That points on importance of such kind 
study. 

Characteristics do not depend on genome size as it has to be for random 
texts. 

In character of behavior of autocorrelation function in several genomes pe- 
riod 3 oscillations were found. This result takes place for any word's length from 
envisaged. 

Short range autocorrelations are present in short (n = 3) words and practi- 
cally absent in longer words. 

Concerning autocorrelations investigation, the results obtained for mtub 
genome seems to be the most interesting. Here we have strong correlations 
with period 3 oscillations for any word length from envisaged and on large scale, 
that could not be detected for other genomes. 

In a whole statistical analysis shows that bacteria genomes are considerably 
varies from each other. Any essential similarities for genomes of a same class (e.g. 
Pyrococcus: pabyssi, pyro, Chlammydia: cpneu, ctra, Mycoplasma: mpneu, 
mgen) were not found. 

If we want to elaborate any general scheme of genomes classification accord- 
ing to statistical analysis, it will be a fairly difficult task. Since always there 
are a lot of exceptions. As for example, the words with CG repeats can form 
as the most frequent words as never met ones. GC fragments in the dominant 
words are more appropriate for the longer genomes (ecoli, mtub, synecho) but 
bsub genome is longer than synecho but does not contain such fragments in the 
dominant word. Reasonable conclusions one can make only on the basis of as 
possibly greater set of factors. So one can suppose that absence of words on 6 
level in hpyl and mjan genomes (having a middle length among investigated) 
is connected with the presence of autocorrelations of the most frequently met 
words more than on a single scale. Strong autocorrelations in mtub genome can 
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point on a specific structure of this genome. Here also one has to mention the 
nontypical characteristics of tpal genome: the smallest index in ranked word 
distribution, autocorrelations on 2 and 4 scales. Here the dominant word con- 
sists of CG repeats (that is the rarest word for other genomes), presence of all 
possible words on level 7. All this factors allow us to claim that this genome is 
the most structureless from the investigated. 
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FIGURES 

Fig. 1 In this figure one can see the word ranked distributions for tpal, mgen, 
aquae genomes in semilogarithm scale. Word length n = 7. 

Fig. 2 In this figure one can see autocorrelation function R(l) for mtub 
genome in semilogarithm scale, word length n — 6. 

Fig. 3 In this figure one can see first 20 points of the word ranked distri- 
butions for hpyl, mgen, mjan, mtub, rpxx, synecho genomes, whose are char- 
acterized by greater (on order) frequency of initial words and aero genome for 
comparison. 



10 



TABLES 

Table 1 



1669695 


aero 


Aeropyrum pernix Kl 


2178400 


aful 


Archaeoglobus fulgidus 


1551335 


aquae 


Aquifex aeolicus 


4214814 


bsub 


Bacillus subtilis 


1230230 


cpneu 


Chlamydia pneumoniae 


1042518 


ctra 


Chlamydia trachomatis 


4639221 


ecoli 


Escherichia coli K-12 MG1655 


1830137 


hinf 


Haemophilus influenzae Rd 


1667867 


hpyl 


Helicobacter pylori 26695 


580074 


mgen 


Mycoplasma genitalium G37 


1664970 


mjan 


Methanococcus jannaschii 


816394 


mpneu 


Mycoplasma pneumoniae M129 


1751377 


mthc 


Methanobacterium thermoautotrophicum delta H 


4411529 


mtub 


Mycobacterium tuberculosis 


1765118 


pabyssi 


Pyrococcus abyssi 


1738505 


pyro 


Pyrococcus horikoshii OT3 


1111523 


rpxx 


Rickettsia prowazekii strain Madrid E 


3573470 


synecho 


Synechocystis PCC6803 


1860725 


tmar 


Thermotoga maritima 


1138011 


tpal 


Treponema pallidum 



In this table size, short (using in the paper) and full names of 20 investigated 
bacteria genomes are presented. 
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Tabic 2 



name 


n=5 


n=6 


n=7 


aero 


6.12 


1.71 


0.49 


aful 


5.92 


1.62 


0.446 


aquae 


6.53 


1.96 


0.581 


bsub 


5.87 


1.6 


0.452 


cpneu 


6.27 


1.67 


0.47 


ctra 


6.08 


1.62 


0.454 


ecoli 


5.71 


1.47 


0.39 


hinf 


7.11 


2.04 


0.595 


hpyl 


7.73 


2.26 


0.717 


mgen 


9.73 


2.79m 


0.86 


mjan 


10.04m 


2.78 


0.857 


mpneu 


6.94 


1.97 


0.572 


mthe 


6.43 


1.7 


0.468 


mtub 


8.6 


2.44 


0.725 


pabyssi 


6.25 


1.62 


0.438 


pyro 


7.39 


1.87 


0.497 


rpxx 


9.21 


2.65 


0.84 


synecho 


6.14 


1.62 


0.448 


tmar 


6.52 


1.84 


0.536 


tpal 


4.49m 


1.25m 


0.356m 



In this table indexes a for each envisaged genome and word's length ng [5,7] 
are presented. 
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Tabic 3 



2 


14/20 


aero, eco li, mthc w , mtub, tmar w , tpal 


3 


12/20 


+ aful, pabyssi 


4 


11/20 


+ pyro 


5 


10/20 


+ aquae 


6 


5/20 


+ hinf, mjan, mpneu, rpxx, synecho 


7 





all 



In this table the results of polyA/T sequences occurrence for n G [2, 7] as well 
as the genomes, whose most frequently met words is no (A/T) n , are presented. 

The first column of the table shows the level number (length of the word); 
the second column - ratio of the number of the genomes, whose have dominant 
polyA/T sequence, to total number of genomes; the third column - the names 
of the genomes, whose most frequently met words is no (A/T) n . 

(We present in this table the genomes whose most frequently met words are 
another than polyA/T because for n = 2 the number of such genomes is less, 
therefore, since if a genome has no polyA/T sequence as dominant on the second 
level (n = 2) it has no one as dominant on any higher level. For n = 3we only 
add to the genomes from previous level such ones, who has dominant polyA/T 
on the second level and has no it on the third and so on up to the level 7.) 
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Tabic 4 



aero 



aful 
4 


aquae 
4 


bsub 



cpneu 
2 


ctra 



ecoli 
1 


hinf 
11 


hpyl 
192 


mgen 
851 


mjan 
318 


mpneu 
7 


mthe 
5 


mtub 
3 


pabyssi 
3 


pyro 
4 


rpxx 
71 


synecho 
1 


tmar 
2 


tpal 




In this table one can see the number of absent words for envisaged genomes. 
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Fig. 1 
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